Exploring Relations of Political Affiliation and Arrests Through the Data Science Pipeline¶

In the United States, our two party system commonly supports very different policies relating to crime. These policies affect how many people are arrested and put into our criminal correction system, and for how long they stay in this system. Being in the corrections system has consequences. While incarcerated, many peoples lives are put on hold. Being in the corrections system leads to the offender and their family being affected financially and emotionally. After being put in the corrections system it is much harder to get a job, housing, and financial aid. The corrections system is also a huge expense for the United States, especially with mass incarceration. Because of this, it is important to look at how criminal policies will affect how many people are put into our corrections system.

Republicans tend to support more hard on crime policies than democrats do. Many of these tough on crime policies increase punishments for crimes. Republicans are also more likely to support criminalization of drugs and many juvenile crimes (like curfews). This could lead to an increase in arrests in states that have a Republican lean. Republicans also tend to have more relaxed laws on weapons and guns, which could result in less arrests related to weapon carrying but also increase chances of violent crimes with weapons occuring. Republicans also argue that due to Democrats having more relaxed crime policies, more crime occurs in democratic areas. This could cause Democratic leaning states to have higher crime. Although, many critics of strict crime policies would argue that the strict policies would result in more people entering the systems and then increasing their likelihood to reoffend and become arrested again.

In this tutorial, we will do a simple analysis comparing arrests rates to the party lean of the state. Ideally, this would be the start to a much bigger analysis of arrest rates, that would look at many more variables.

To look at the relationship between party lean of a state and arrests rates in the state, will will go through the five steps of the data science pipeline:

  1. Data Collection and Parsing
  2. Data Management and Representation
  3. Exploratory Data Analysis
  4. Hypothesis Testing
  5. Communication of Insights

To complete the tutorial, you will need the following libraries:

In [19]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import json
import folium
from folium.features import GeoJsonTooltip
import geopandas as gpd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Data collection and parsing¶

In this stage, we collect data. There are many ways to do this, you can collect your own data through your own study, you can scrape data off the web, or you can download a dataframe. For this tutorial, we will scrape data for three different things: party affiliation of each state, arrests rates for each state, and population of each state. As party affiliation, population, and polices change over time, it is important that we get the data all from the same time period. For this tutorial, we are looking at 2017.

When scraping data off the web, you will have to parse the HTML to put it in a dataframe. I will walk you through three examples of scraping data from three different websites.

We will use:

  • https://news.gallup.com/poll/226643/2017-party-affiliation-state.aspx for political affilation of each state
  • https://ucr.fbi.gov/crime-in-the-u.s/2017/crime-in-the-u.s.-2017/topic-pages/tables/table-69 for arrest rates in each state
  • https://www.newgeography.com/content/005837-the-migration-millions-2017-state-population-estimates for population of each state

To put our first dataset into a dataframe, we need to get the html and store it with requests.get('url'). We will then use BeauitifulSoup to display the html contents of the page so we can inspect it and examine the table format. To find the table structure, I do ctrl + f and search for table.

In [20]:
party_affiliation = requests.get('https://news.gallup.com/poll/226643/2017-party-affiliation-state.aspx')
party_affiliation_soup = BeautifulSoup(party_affiliation.content, 'html.parser')
# party_affiliation_soup

Looking at the table structure, we see that all the data is stored in the table rows (tr). Now we want to collect the table rows in a list.

In [21]:
party_affiliation_elements = party_affiliation_soup.findAll('tr') # This collects all the <tr> elements into a list
# party_affiliation_elements

Examining the elements list above, we see that all the State names are in <th> and the Percent Democrat, Percent Republican, and Party Lean are in <td>. We can figure out the index by counting its <td> location in the <tr> element. The first <td> element is at index 0.

In [29]:
# We want to collect data for 4 columns of the dataframe we are creating

# dictionary we will store the data in after we scrape it
party_affiliation_proto_df = { 'State' : [], 'Percent Democrat' : [], 'Percent Republican' : [], 'Party Lean' : []} 

# the data starts at the third index of elements so we will start there
for i in range(3, len(party_affiliation_elements) - 2):
#     print ('********************')
#     print(i)
#     print(party_affiliation_elements[i].findChildren("th")[0].get_text())
#     for j in party_affiliation_elements[i].findChildren("td"):
#         print(j.get_text())
    party_affiliation_proto_df['State'].append(party_affiliation_elements[i].findChildren('th')[0].get_text())
    party_affiliation_proto_df['Percent Democrat'].append(party_affiliation_elements[i].findChildren('td')[0].get_text())
    party_affiliation_proto_df['Percent Republican'].append(party_affiliation_elements[i].findChildren('td')[1].get_text())
    party_affiliation_proto_df['Party Lean'].append(party_affiliation_elements[i].findChildren('td')[4].get_text())

Now we need to sort the proto_df so the states are in alphabetical order. This will make it easier for us to combine the data with the data from other sources. The sorting algorithm below is bubble sort. We have to make sure that as we sort the States, we also sort its other data along with it.

In [31]:
n = len(party_affiliation_proto_df['State'])
for i in range(n-1):
        for j in range(0, n-i-1):
            if party_affiliation_proto_df['State'][j] > party_affiliation_proto_df['State'][j + 1]:
                swapped = True
                j_0 = party_affiliation_proto_df['State'][j]
                j_1 = party_affiliation_proto_df['State'][j + 1]
                party_affiliation_proto_df['State'][j] = j_1
                party_affiliation_proto_df['State'][j + 1] = j_0
                d_0 = party_affiliation_proto_df['Percent Democrat'][j]
                d_1 = party_affiliation_proto_df['Percent Democrat'][j + 1]
                party_affiliation_proto_df['Percent Democrat'][j] = d_1
                party_affiliation_proto_df['Percent Democrat'][j + 1] = d_0
                r_0 = party_affiliation_proto_df['Percent Republican'][j]
                r_1 = party_affiliation_proto_df['Percent Republican'][j + 1]
                party_affiliation_proto_df['Percent Republican'][j] = r_1
                party_affiliation_proto_df['Percent Republican'][j + 1] = r_0
                p_0 = party_affiliation_proto_df['Party Lean'][j]
                p_1 = party_affiliation_proto_df['Party Lean'][j + 1]
                party_affiliation_proto_df['Party Lean'][j] = p_1
                party_affiliation_proto_df['Party Lean'][j + 1] = p_0
# party_affiliation_proto_df

Now we can collect our arrest data from the UCR in a similar way.

In [32]:
arrests = requests.get('https://ucr.fbi.gov/crime-in-the-u.s/2017/crime-in-the-u.s.-2017/topic-pages/tables/table-69')
arrests_soup = BeautifulSoup(arrests.content, 'html.parser')
# arrests_soup
In [33]:
arrests_elements = arrests_soup.findAll('tr')
# arrests_elements

This web page is formatted in a little bit of a different way. We see that there are two table rows per state. The first row contains the state name and information on juvenile arrests. The Second row contains the information on all ages. The information we want will the state name and arrests on all ages. The UCR provides data on a lot of different crimes, so I chose a few ones that I thought would be interesting to look at through the lens of party affiliation. Many of these crimes relate to tough on crime policies. The political parties also have legalized weapons and drugs to different extents. I looked at embezzlement to see if there could be any difference in White Collar crimes committed across party affiliations. We can again examine the elements list along with the website to figure out the index of each <td> within a <tr> element

In [34]:
arrests_proto_df = { 'State2' : [], 'Total' : [], 'Violent' : [], 'Property' : [], 'Drug Abuse' : [], 'Curfew and Loitering' : [], 
                    'Embezzlement' : [], 'Weapons' : [], 'Prostitution and Commercialized Vice' : [], 'Vagrancy' : [], 'Population' : []}

# We will start at index 1 as we see this is where the dtata starts after examining the elements list
for i in range(1, len(arrests_elements)):
    # after examining the data, we see that the all ages arrest rates are on even number indexs
    if (i % 2) == 0:
        arrests_proto_df['Total'].append(arrests_elements[i].findChildren('td')[0].get_text().replace("\n", "").replace(",", ""))
        arrests_proto_df['Violent'].append(arrests_elements[i].findChildren('td')[1].get_text().replace("\n", "").replace(",", ""))
        arrests_proto_df['Property'].append(arrests_elements[i].findChildren('td')[2].get_text().replace("\n", "").replace(",", ""))
        arrests_proto_df['Drug Abuse'].append(arrests_elements[i].findChildren('td')[20].get_text().replace("\n", "").replace(",", ""))
        arrests_proto_df['Curfew and Loitering'].append(arrests_elements[i].findChildren('td')[30].get_text().replace("\n", "").replace(",", ""))
        arrests_proto_df['Embezzlement'].append(arrests_elements[i].findChildren('td')[14].get_text().replace("\n", "").replace(",", ""))
        arrests_proto_df['Weapons'].append(arrests_elements[i].findChildren('td')[17].get_text().replace("\n", "").replace(",", ""))
        arrests_proto_df['Prostitution and Commercialized Vice'].append(arrests_elements[i].findChildren('td')[18].get_text().replace("\n", "").replace(",", ""))
        arrests_proto_df['Vagrancy'].append(arrests_elements[i].findChildren('td')[27].get_text().replace("\n", "").replace(",", ""))
    else: 
        arrests_proto_df['State2'].append(arrests_elements[i].findChildren('th')[0].get_text().replace("\n", ""))
        
        

We again do a similar process to get the population for each state from out third website

In [35]:
population = requests.get('https://www.newgeography.com/content/005837-the-migration-millions-2017-state-population-estimates')
population_soup = BeautifulSoup(population.content, 'html.parser')
# population_soup
In [36]:
population_elements = population_soup.findAll('tr')
# population_elements

Upon examining the data, we see that this website is strucutred similar to the first one

In [37]:
# upon examining the elements list we see that the state data starts at index 4
for i in range(4, len(population_elements) - 1):
#     print ('********************')
    arrests_proto_df['Population'].append(population_elements[i].findChildren('td')[2].get_text().replace(",", ""))
#     print(population_elements[i].findChildren('td')[0].get_text())
#     print(population_elements[i].findChildren('td')[2].get_text())
In [38]:
# party_affiliation_proto_df

Data Management and Data Representation¶

In this stage, we will put the data into a dataframe than perform any data transfromations and cleaning we need to do. Usually, in the stage you deal with missing information. However, we do not have any missing values in our data. Here is an article that talks about different ways to handle missing data: https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e

In [39]:
# combines the two proto_df together
proto_df = dict(party_affiliation_proto_df, **arrests_proto_df)

# puts the proto_df in a pandas dataframe
df = pd.DataFrame.from_dict(proto_df)

df.head(51)
Out[39]:
State Percent Democrat Percent Republican Party Lean State2 Total Violent Property Drug Abuse Curfew and Loitering Embezzlement Weapons Prostitution and Commercialized Vice Vagrancy Population
0 Alabama 35 50 Solid Rep ALABAMA 153285 6005 18366 11267 0 161 1963 2 56 4874747
1 Alaska 31 52 Solid Rep ALASKA 29152 2374 3695 1004 0 77 369 5 2 739795
2 Arizona 40 42 Competitive ARIZONA 277698 12832 34919 33230 1126 457 3090 303 616 7016270
3 Arkansas 36 45 Lean Rep ARKANSAS 123971 4715 12355 17515 310 54 1106 172 315 3004279
4 California 51 30 Solid Dem CALIFORNIA 1093363 109117 105757 212025 998 990 28547 7056 6773 39536653
5 Colorado 46 37 Lean Dem COLORADO 234409 7721 27157 16626 1052 126 2309 521 596 5607154
6 Connecticut 51 32 Solid Dem CONNECTICUT 100211 3844 14074 9174 3 144 1242 183 24 3588184
7 Delaware 45 33 Solid Dem DELAWARE 31549 1989 6156 4015 33 221 338 81 226 961939
8 District of Columbia 70 11 Solid Dem DISTRICT OF COLUMBIA5 17986 133 101 237 0 0 47 0 10 693972
9 Florida 42 39 Competitive FLORIDA6, 7 713085 36251 91063 124487 5 1057 6306 2468 0 20984400
10 Georgia 42 40 Competitive GEORGIA 230640 10843 31080 40608 331 292 3317 379 1285 10429379
11 Hawaii 50 28 Solid Dem HAWAII 35618 1227 3748 2724 217 11 247 120 0 1427538
12 Idaho 31 53 Solid Rep IDAHO 51686 1441 5192 8432 137 43 276 13 15 1716943
13 Illinois 50 33 Solid Dem ILLINOIS6 64552 3823 11653 10915 51 1 4290 141 18 12802023
14 Indiana 41 43 Competitive INDIANA 147199 8345 17546 26364 210 266 2283 305 347 6666818
15 Iowa 42 42 Competitive IOWA 94485 4709 12280 9645 252 79 900 46 17 3145711
16 Kansas 34 48 Solid Rep KANSAS 61144 2198 4958 9594 0 111 742 127 0 2913123
17 Kentucky 41 45 Competitive KENTUCKY 219872 3398 17117 26397 2 390 967 166 70 4454189
18 Louisiana 40 43 Competitive LOUISIANA 166867 9494 29489 25770 246 146 3234 412 243 4684333
19 Maine 47 39 Lean Dem MAINE 40675 780 5578 3409 17 41 130 119 0 1335907
20 Maryland 56 28 Solid Dem MARYLAND 165877 9692 21726 28992 65 77 3215 748 82 6052177
21 Massachusetts 57 26 Solid Dem MASSACHUSETTS 118676 9257 13671 9791 2 122 1328 530 8 6859819
22 Michigan 45 38 Lean Dem MICHIGAN 244417 11798 24692 33090 319 1174 4590 233 139 9962311
23 Minnesota 47 37 Solid Dem MINNESOTA 143702 5769 26025 19281 869 23 2208 359 42 5576606
24 Mississippi 38 45 Lean Rep MISSISSIPPI 78239 1492 8677 10268 90 383 1110 25 105 2984100
25 Missouri 38 45 Lean Rep MISSOURI 228042 9371 31387 39979 644 290 3537 368 1096 6113532
26 Montana 37 51 Solid Rep MONTANA 30824 1262 4965 2872 306 30 85 9 10 1050493
27 Nebraska 35 50 Solid Rep NEBRASKA 5708 161 523 843 16 3 65 1 0 1920076
28 Nevada 42 39 Competitive NEVADA 112004 6506 8946 8923 412 201 1885 2532 1648 2998039
29 New Hampshire 43 40 Competitive NEW HAMPSHIRE 47516 896 3642 7656 11 120 139 79 82 1342795
30 New Jersey 48 33 Solid Dem NEW JERSEY 281631 8577 24807 61989 592 269 3838 900 299 9005644
31 New Mexico 48 34 Solid Dem NEW MEXICO 40552 2708 4340 2377 0 74 179 35 64 2088070
32 New York 52 29 Solid Dem NEW YORK6 259257 12440 43759 69904 0 48 3468 672 618 19849399
33 North Carolina 44 39 Competitive NORTH CAROLINA 240365 11630 34918 25902 18 1185 5147 374 5 10273419
34 North Dakota 28 56 Solid Rep NORTH DAKOTA 39944 855 3787 5646 107 53 313 38 7 755393
35 Ohio 41 42 Competitive OHIO 224423 8084 33068 37022 542 21 3681 1122 208 11658609
36 Oklahoma 35 49 Solid Rep OKLAHOMA 106731 4820 14352 20024 862 408 2207 15 58 3930864
37 Oregon 49 36 Solid Dem OREGON 124851 3687 17578 15682 439 49 2077 364 22 4142776
38 Pennsylvania 46 41 Competitive PENNSYLVANIA 372570 20385 50437 63306 6723 478 5056 1755 298 12805537
39 Rhode Island 48 27 Solid Dem RHODE ISLAND 22707 864 2458 1792 11 102 379 71 1 1059639
40 South Carolina 37 47 Solid Rep SOUTH CAROLINA 153573 6500 23848 32266 32 358 2247 381 484 5024369
41 South Dakota 35 52 Solid Rep SOUTH DAKOTA 63625 1748 3283 8900 171 37 334 33 565 869666
42 Tennessee 35 47 Solid Rep TENNESSEE 350912 14682 38394 47826 1214 664 3081 955 5 6715984
43 Texas 38 41 Competitive TEXAS 745719 30372 78504 136796 2301 794 12954 4482 872 28304596
44 Utah 29 56 Solid Rep UTAH 104741 2252 14526 17556 356 32 896 379 38 3101833
45 Vermont 52 30 Solid Dem VERMONT 13969 664 1683 1137 0 43 28 4 0 623657
46 Virginia 45 38 Lean Dem VIRGINIA 258878 7068 26582 42060 741 1371 4099 617 177 8470020
47 Washington 49 34 Solid Dem WASHINGTON 173344 8472 27055 12027 0 53 1793 650 124 7405743
48 West Virginia 40 44 Competitive WEST VIRGINIA 39296 1832 5077 7277 9 116 342 109 27 1815857
49 Wisconsin 43 41 Competitive WISCONSIN 252142 8023 29429 30781 1767 325 3449 451 809 5795483
50 Wyoming 27 56 Solid Rep WYOMING 27914 612 2374 4612 90 18 78 48 36 579315

Above, we can see that the data matched up by looking at the states columns. Now we can drop the states2 column since we dont need it anymore

In [40]:
df = df.drop(['State2'], axis=1)

Here we need to make all our number variables numeric so we can work with them later (doing math operations, graphs, and machine learning algorithms). He also will change the number of arrests to be percent of arrests in population. We do this by dividing each arrest data by population than multiplying by 100. This allows us to compare between states.

In [41]:
df['Population'] = pd.to_numeric(df['Population'])
df['Percent Democrat'] = pd.to_numeric(df['Percent Democrat'])
df['Percent Republican'] = pd.to_numeric(df['Percent Republican'])

df['Total'] = pd.to_numeric(df['Total'])
df['Total'] = df['Total']/df['Population']*100

df['Violent'] = pd.to_numeric(df['Violent'])
df['Violent'] = df['Violent']/df['Population']*100

df['Property'] = pd.to_numeric(df['Property'])
df['Property'] = df['Property']/df['Population']*100

df['Drug Abuse'] = pd.to_numeric(df['Drug Abuse'])
df['Drug Abuse'] = df['Drug Abuse']/df['Population']*100

df['Curfew and Loitering'] = pd.to_numeric(df['Curfew and Loitering'])
df['Curfew and Loitering'] = df['Curfew and Loitering']/df['Population']*100

df['Embezzlement'] = pd.to_numeric(df['Embezzlement'])
df['Embezzlement'] = df['Embezzlement']/df['Population']*100

df['Weapons'] = pd.to_numeric(df['Weapons'])
df['Weapons'] = df['Weapons']/df['Population']*100

df['Prostitution and Commercialized Vice'] = pd.to_numeric(df['Prostitution and Commercialized Vice'])
df['Prostitution and Commercialized Vice'] = df['Prostitution and Commercialized Vice']/df['Population']*100

df['Vagrancy'] = pd.to_numeric(df['Vagrancy'])
df['Vagrancy'] = df['Vagrancy']/df['Population']*100

df
Out[41]:
State Percent Democrat Percent Republican Party Lean Total Violent Property Drug Abuse Curfew and Loitering Embezzlement Weapons Prostitution and Commercialized Vice Vagrancy Population
0 Alabama 35 50 Solid Rep 3.144471 0.123186 0.376758 0.231130 0.000000 0.003303 0.040269 0.000041 0.001149 4874747
1 Alaska 31 52 Solid Rep 3.940551 0.320900 0.499463 0.135713 0.000000 0.010408 0.049879 0.000676 0.000270 739795
2 Arizona 40 42 Competitive 3.957915 0.182889 0.497686 0.473613 0.016048 0.006513 0.044040 0.004319 0.008780 7016270
3 Arkansas 36 45 Lean Rep 4.126481 0.156943 0.411247 0.583002 0.010319 0.001797 0.036814 0.005725 0.010485 3004279
4 California 51 30 Solid Dem 2.765441 0.275989 0.267491 0.536275 0.002524 0.002504 0.072204 0.017847 0.017131 39536653
5 Colorado 46 37 Lean Dem 4.180534 0.137699 0.484328 0.296514 0.018762 0.002247 0.041180 0.009292 0.010629 5607154
6 Connecticut 51 32 Solid Dem 2.792805 0.107129 0.392232 0.255673 0.000084 0.004013 0.034614 0.005100 0.000669 3588184
7 Delaware 45 33 Solid Dem 3.279730 0.206770 0.639957 0.417386 0.003431 0.022974 0.035137 0.008420 0.023494 961939
8 District of Columbia 70 11 Solid Dem 2.591747 0.019165 0.014554 0.034151 0.000000 0.000000 0.006773 0.000000 0.001441 693972
9 Florida 42 39 Competitive 3.398167 0.172752 0.433956 0.593236 0.000024 0.005037 0.030051 0.011761 0.000000 20984400
10 Georgia 42 40 Competitive 2.211445 0.103966 0.298004 0.389362 0.003174 0.002800 0.031804 0.003634 0.012321 10429379
11 Hawaii 50 28 Solid Dem 2.495065 0.085952 0.262550 0.190818 0.015201 0.000771 0.017303 0.008406 0.000000 1427538
12 Idaho 31 53 Solid Rep 3.010350 0.083928 0.302398 0.491105 0.007979 0.002504 0.016075 0.000757 0.000874 1716943
13 Illinois 50 33 Solid Dem 0.504233 0.029862 0.091025 0.085260 0.000398 0.000008 0.033510 0.001101 0.000141 12802023
14 Indiana 41 43 Competitive 2.207935 0.125172 0.263184 0.395451 0.003150 0.003990 0.034244 0.004575 0.005205 6666818
15 Iowa 42 42 Competitive 3.003613 0.149696 0.390373 0.306608 0.008011 0.002511 0.028610 0.001462 0.000540 3145711
16 Kansas 34 48 Solid Rep 2.098916 0.075452 0.170195 0.329337 0.000000 0.003810 0.025471 0.004360 0.000000 2913123
17 Kentucky 41 45 Competitive 4.936297 0.076288 0.384290 0.592633 0.000045 0.008756 0.021710 0.003727 0.001572 4454189
18 Louisiana 40 43 Competitive 3.562236 0.202676 0.629524 0.550132 0.005252 0.003117 0.069039 0.008795 0.005188 4684333
19 Maine 47 39 Lean Dem 3.044748 0.058387 0.417544 0.255182 0.001273 0.003069 0.009731 0.008908 0.000000 1335907
20 Maryland 56 28 Solid Dem 2.740782 0.160141 0.358978 0.479034 0.001074 0.001272 0.053121 0.012359 0.001355 6052177
21 Massachusetts 57 26 Solid Dem 1.730016 0.134945 0.199291 0.142730 0.000029 0.001778 0.019359 0.007726 0.000117 6859819
22 Michigan 45 38 Lean Dem 2.453417 0.118426 0.247854 0.332152 0.003202 0.011784 0.046074 0.002339 0.001395 9962311
23 Minnesota 47 37 Solid Dem 2.576872 0.103450 0.466682 0.345748 0.015583 0.000412 0.039594 0.006438 0.000753 5576606
24 Mississippi 38 45 Lean Rep 2.621863 0.049998 0.290774 0.344090 0.003016 0.012835 0.037197 0.000838 0.003519 2984100
25 Missouri 38 45 Lean Rep 3.730119 0.153283 0.513402 0.653943 0.010534 0.004744 0.057855 0.006019 0.017927 6113532
26 Montana 37 51 Solid Rep 2.934241 0.120134 0.472635 0.273395 0.029129 0.002856 0.008091 0.000857 0.000952 1050493
27 Nebraska 35 50 Solid Rep 0.297280 0.008385 0.027239 0.043905 0.000833 0.000156 0.003385 0.000052 0.000000 1920076
28 Nevada 42 39 Competitive 3.735909 0.217009 0.298395 0.297628 0.013742 0.006704 0.062874 0.084455 0.054969 2998039
29 New Hampshire 43 40 Competitive 3.538589 0.066726 0.271225 0.570154 0.000819 0.008937 0.010352 0.005883 0.006107 1342795
30 New Jersey 48 33 Solid Dem 3.127272 0.095240 0.275461 0.688335 0.006574 0.002987 0.042618 0.009994 0.003320 9005644
31 New Mexico 48 34 Solid Dem 1.942080 0.129689 0.207847 0.113837 0.000000 0.003544 0.008573 0.001676 0.003065 2088070
32 New York 52 29 Solid Dem 1.306120 0.062672 0.220455 0.352172 0.000000 0.000242 0.017472 0.003385 0.003113 19849399
33 North Carolina 44 39 Competitive 2.339679 0.113205 0.339887 0.252126 0.000175 0.011535 0.050100 0.003640 0.000049 10273419
34 North Dakota 28 56 Solid Rep 5.287844 0.113186 0.501328 0.747426 0.014165 0.007016 0.041435 0.005030 0.000927 755393
35 Ohio 41 42 Competitive 1.924955 0.069339 0.283636 0.317551 0.004649 0.000180 0.031573 0.009624 0.001784 11658609
36 Oklahoma 35 49 Solid Rep 2.715205 0.122619 0.365111 0.509405 0.021929 0.010379 0.056145 0.000382 0.001476 3930864
37 Oregon 49 36 Solid Dem 3.013704 0.088998 0.424305 0.378538 0.010597 0.001183 0.050135 0.008786 0.000531 4142776
38 Pennsylvania 46 41 Competitive 2.909445 0.159189 0.393869 0.494364 0.052501 0.003733 0.039483 0.013705 0.002327 12805537
39 Rhode Island 48 27 Solid Dem 2.142900 0.081537 0.231966 0.169114 0.001038 0.009626 0.035767 0.006700 0.000094 1059639
40 South Carolina 37 47 Solid Rep 3.056563 0.129369 0.474647 0.642190 0.000637 0.007125 0.044722 0.007583 0.009633 5024369
41 South Dakota 35 52 Solid Rep 7.316027 0.200997 0.377501 1.023381 0.019663 0.004255 0.038406 0.003795 0.064967 869666
42 Tennessee 35 47 Solid Rep 5.225027 0.218613 0.571681 0.712122 0.018076 0.009887 0.045876 0.014220 0.000074 6715984
43 Texas 38 41 Competitive 2.634622 0.107304 0.277354 0.483300 0.008129 0.002805 0.045766 0.015835 0.003081 28304596
44 Utah 29 56 Solid Rep 3.376745 0.072602 0.468304 0.565988 0.011477 0.001032 0.028886 0.012219 0.001225 3101833
45 Vermont 52 30 Solid Dem 2.239853 0.106469 0.269860 0.182312 0.000000 0.006895 0.004490 0.000641 0.000000 623657
46 Virginia 45 38 Lean Dem 3.056404 0.083447 0.313836 0.496575 0.008749 0.016187 0.048394 0.007285 0.002090 8470020
47 Washington 49 34 Solid Dem 2.340670 0.114398 0.365325 0.162401 0.000000 0.000716 0.024211 0.008777 0.001674 7405743
48 West Virginia 40 44 Competitive 2.164047 0.100889 0.279593 0.400747 0.000496 0.006388 0.018834 0.006003 0.001487 1815857
49 Wisconsin 43 41 Competitive 4.350664 0.138435 0.507792 0.531121 0.030489 0.005608 0.059512 0.007782 0.013959 5795483
50 Wyoming 27 56 Solid Rep 4.818449 0.105642 0.409794 0.796113 0.015536 0.003107 0.013464 0.008286 0.006214 579315

Exploratory data analysis¶

In this stage, we will plot our data in different ways in order to find patterns that we could make hypothesis about and apply machine learning algorithms to. In this stage, it is important to use good visuals. Considering things like colors, size, and labels will be important.

In [42]:
# To graph maps, we will need a geojson. For this turotial, we will download it from here: 
# https://public.opendatasoft.com/explore/dataset/georef-united-states-of-america-state-millesime/table/?disjunctive.ste_code&disjunctive.ste_name&sort=year
# This geojson provides geo data for all 50 states and DC
geojson = gpd.read_file(r'georef-united-states-of-america-state-millesime.geojson')

We will now create three maps showing the concentration of republicans, democrats and arrests in each state. This allows us to visually see where the concentrations exist.

In [43]:
# Map showing concentration of democratics
geojson=geojson[['ste_name','geometry']] # only select the data we will be working with (state name and geometry columns)

us_map = folium.Map(location=[40, -96], zoom_start=3,tiles='openstreetmap')

folium.Choropleth(
            geo_data=r'georef-united-states-of-america-state-millesime.geojson',
            data=df,
            columns=['State', 'Percent Democrat'],  
            key_on='feature.properties.ste_name', # the variable containing state name in the geojosn
            fill_color='PuBu',
            fill_opacity=0.7,
            line_opacity=0.2,
            legend_name='Percent Democrat', #title of the legend
            line_color='black').add_to(us_map) 

us_map
Out[43]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [44]:
# Map showing concentration of republicans
geojson=geojson[['ste_name','geometry']] # only select the data we will be working with (state name and geometry columns)

us_map = folium.Map(location=[40, -96], zoom_start=3,tiles='openstreetmap')

folium.Choropleth(
            geo_data=r'georef-united-states-of-america-state-millesime.geojson',
            data=df,
            columns=['State', 'Percent Republican'],  
            key_on='feature.properties.ste_name',
            fill_color='PuRd',
            fill_opacity=0.7,
            line_opacity=0.2,
            legend_name='Percent Republican', #title of the legend
            line_color='black').add_to(us_map) 

us_map
Out[44]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [45]:
# concentration of arrests
geojson=geojson[['ste_name','geometry']] # only select the data we will be working with (state name and geometry columns)

us_map = folium.Map(location=[40, -96], zoom_start=3,tiles='openstreetmap')

folium.Choropleth(
            geo_data=r'georef-united-states-of-america-state-millesime.geojson',
            data=df,
            columns=['State', 'Total'],  
            key_on='feature.properties.ste_name',
            fill_color='Greys',
            fill_opacity=0.7,
            line_opacity=0.2,
            legend_name='Percent Arrests', 
            line_color='black').add_to(us_map) 

us_map
Out[45]:
Make this Notebook Trusted to load map: File -> Trust Notebook

It's hard to see patterns here, but you can see that some states with high concentration of republicans also have high concentrations of arrests. For example, we can see that South Dekota is very dark on the map of arrests, and we can also see that it is very dark on the map of republicans. Now we will do some more graph that display trends better.

In [46]:
# Group by Party lean so we can graph by party mean
new_df = df.groupby('Party Lean')['Total'].mean()
new_df = new_df.reindex(['Solid Dem' , 'Lean Dem', 'Competitive', 'Lean Rep', 'Solid Rep'])
new_df
Out[46]:
Party Lean
Solid Dem      2.349331
Lean Dem       3.183776
Competitive    3.125035
Lean Rep       3.492821
Solid Rep      3.632436
Name: Total, dtype: float64
In [47]:
ax = new_df.plot(kind='bar', figsize=(10,6), fontsize=12);
ax.set_alpha(0.8)
ax.set_title("Average Percent of Arrests Based on Party Lean of State", fontsize=22)
ax.set_ylabel("Percent of arrests", fontsize=12);
plt.show()

We see that there seems to be a trend that as the party lean becomes more republican, the percent of arrests seems to increase. Now we will look at scatter plots with regression lines to see if we see a trend between percent republican/democrat and arrest percents

In [48]:
# scatter plot for percent democrats v precent arrests
dem_v_crime = df.plot.scatter(x = 'Percent Democrat', y = 'Total', s = 100);

m, b = np.polyfit(df['Percent Democrat'], df['Total'], 1)
plt.plot(df['Percent Democrat'], m*df['Percent Democrat']+b, color='red') # adds regression line
print('Slope: ' + str(m))
plt.title('Percent of Democrats v Arrests in States')
plt.xlabel('Percent of Democrats')
plt.ylabel('Total Arrests - measured in percent of arrests comapred to state population')
fig = plt.gcf()
fig.set_size_inches(10, 5)
Slope: -0.06143752311474804
In [49]:
# scatter plot for percent republican v precent arrests
dem_v_crime = df.plot.scatter(x = 'Percent Republican', y = 'Total', s = 100);

m, b = np.polyfit(df['Percent Republican'], df['Total'], 1)
plt.plot(df['Percent Republican'], m*df['Percent Republican']+b, color='red') # adds regression line
print('Slope: ' + str(m))
plt.title('Percent of Republican v Arrests in States')
plt.xlabel('Percent of Republican')
plt.ylabel('Total Arrests - measured in percent of arrests comapred to state population')
fig = plt.gcf()
fig.set_size_inches(10, 5)
Slope: 0.056702718754141464

From these 2 graphs there seems to be a slight connection between precent republican/democrat and percent of arrests. The data seems to be more clustered around the regression line in percent republican than in percent democrat.

Here is a website showing other types of graphs you could use in data visualization: https://www.analyticsvidhya.com/blog/2021/12/12-data-plot-types-for-visualization/

Hypothesis testing¶

Now that we see a bit of a trend between political affiliation and percent republican/democrat, we need to create a null hypothesis to test. Our null hypothesis will be:

Null Hypothesis: There is no difference in the percent of arrests based off party lean of the state

At the conclusion of this phase, we will either reject or fail to reject the null hypothesis. I will walk you throuh some ways to test the hypothesis and what the qulaificantions will be to reject the null hypothesis.

Since we visually seemed to see that the relationship was stronger when looking at percent of republicans compared to arrests, we will focus on testing this relationship.

We will start testing this by making a linear regression model between Percent Republican and Percent total arrests in the state and see how accurate it is. The linear regression model will give us the same prediction has the regression line, we can see that the slopes are the same. We can use the linear regression model to get a score. The score it gives us is the R-squared score. What is considered a good R-squared value often varies, but usually you want it to be at least between 0.5 and 0.9. The higher the score, the better.

In [50]:
# prepares the x and y varibales for the linear regression model
X = np.array(df['Percent Republican']).reshape(-1, 1)
y = np.array(df['Total']).reshape(-1, 1)

regrTotal = LinearRegression()
regrTotal.fit(X, y)

print('Score: ' + str(regrTotal.score(X, y)))
print('Slope: ' + str(regrTotal.coef_[0]))
Score: 0.1849858037416766
Slope: [0.05670272]

The score is only 0.18, not good enough to reject our null hypothesis. To continue our testing we will explore if there could be a stronger relationship between percent republican and more specific crime arrests.

In [51]:
# prepares the x and y varibales for the linear regression model
X = np.array(df['Percent Republican']).reshape(-1, 1)
y = np.array(df['Violent']).reshape(-1, 1)

regrViolent = LinearRegression()
regrViolent.fit(X, y)

print('Score: ' + str(regrViolent.score(X, y)))
print('Slope: ' + str(regrViolent.coef_[0]))

y_pred = regrViolent.predict(X)
plt.scatter(X, y, color ='b')
plt.plot(X, y_pred, color ='k')
plt.title('Percent of Republican v Violent Crime Arrests in States')
plt.xlabel('Percent Republicans')
plt.ylabel('Percent Arrests')
  
plt.show()
Score: 0.021129945396548666
Slope: [0.00097035]

This score (0.02) is even lower so this is also not good enough to reject the null hypothesis. We can also see that the slope (0.00097) is very close to 0. This is another indicator that the relationship is not very strong and we do not have enough evidence to reject the null hypothesis.

In [52]:
# prepares the x and y varibales for the linear regression model
X = np.array(df['Percent Republican']).reshape(-1, 1)
y = np.array(df['Property']).reshape(-1, 1)

regrProperty = LinearRegression()
regrProperty.fit(X, y)

print('Score: ' + str(regrProperty.score(X, y)))
print('Slope: ' + str(regrProperty.coef_[0]))

y_pred = regrProperty.predict(X)
plt.scatter(X, y, color ='b')
plt.plot(X, y_pred, color ='k')
plt.title('Percent of Republican v Propoerty Crime Arrests in States')
plt.xlabel('Percent Republicans')
plt.ylabel('Percent Arrests')
  
plt.show()
Score: 0.17123305618061757
Slope: [0.00617072]

The slope (0.0062) is much steeper here, indicating the relationship is stronger for percent Property crime arrests and percent republicans. However, the R-squared value (0.17) is still too low to reject the null hypothesis.

In [53]:
# prepares the x and y varibales for the linear regression model
X = np.array(df['Percent Republican']).reshape(-1, 1)
y = np.array(df['Drug Abuse']).reshape(-1, 1)

regrDrug = LinearRegression()
regrDrug.fit(X, y)

print('Score: ' + str(regrDrug.score(X, y)))
print('Slope: ' + str(regrDrug.coef_[0]))

y_pred = regrDrug.predict(X)
plt.scatter(X, y, color ='b')
plt.plot(X, y_pred, color ='k')
plt.title('Percent of Republican v Drug Abuse Arrests in States')
plt.xlabel('Percent Republicans')
plt.ylabel('Percent Arrests')
  
plt.show()
Score: 0.2391391623509247
Slope: [0.01133622]

So far, this is the highest r squared value (0.24) we have seen, however, it is still too low to reject the null hypothesis.

In [54]:
# prepares the x and y varibales for the linear regression model
X = np.array(df['Percent Republican']).reshape(-1, 1)
y = np.array(df['Curfew and Loitering']).reshape(-1, 1)

regrCurfew = LinearRegression()
regrCurfew.fit(X, y)

print('Score: ' + str(regrCurfew.score(X, y)))
print('Slope: ' + str(regrCurfew.coef_[0]))

y_pred = regrCurfew.predict(X)
plt.scatter(X, y, color ='b')
plt.plot(X, y_pred, color ='k')
plt.title('Percent of Republican v Curfew and Loitering Arrests in States')
plt.xlabel('Percent Republicans')
plt.ylabel('Percent Arrests')
  
plt.show()
Score: 0.08359687854225373
Slope: [0.00032706]

The R squared value (0.08) is again very low so we still cannot reject the null hypothesis.

In [55]:
# prepares the x and y varibales for the linear regression model
X = np.array(df['Percent Republican']).reshape(-1, 1)
y = np.array(df['Embezzlement']).reshape(-1, 1)

regrEmbezzlement = LinearRegression()
regrEmbezzlement.fit(X, y)

print('Score: ' + str(regrEmbezzlement.score(X, y)))
print('Slope: ' + str(regrEmbezzlement.coef_[0]))

y_pred = regrEmbezzlement.predict(X)
plt.scatter(X, y, color ='b')
plt.plot(X, y_pred, color ='k')
plt.title('Percent of Republican v Embezzlement Arrests in States')
plt.xlabel('Percent Republicans')
plt.ylabel('Percent Arrests')
  
plt.show()
Score: 0.012056180315819565
Slope: [5.59468046e-05]

This is the lowest R-squared value (0.01) we have seen so far. So, we cannot reject the null hypothesis here either.

In [56]:
# prepares the x and y varibales for the linear regression model
X = np.array(df['Percent Republican']).reshape(-1, 1)
y = np.array(df['Weapons']).reshape(-1, 1)

regrWeapons = LinearRegression()
regrWeapons.fit(X, y)

print('Score: ' + str(regrWeapons.score(X, y)))
print('Slope: ' + str(regrWeapons.coef_[0]))

y_pred = regrWeapons.predict(X)
plt.scatter(X, y, color ='b')
plt.plot(X, y_pred, color ='k')
plt.title('Percent of Republican v Weapons Arrests in States')
plt.xlabel('Percent Republicans')
plt.ylabel('Percent Arrests')
  
plt.show()
Score: 0.007223502876974708
Slope: [0.00016128]

This R squared value (0.007) is even lower, way too low to reject the null hypothesis.

In [57]:
# prepares the x and y varibales for the linear regression model
X = np.array(df['Percent Republican']).reshape(-1, 1)
y = np.array(df['Prostitution and Commercialized Vice']).reshape(-1, 1)

regrProstitution = LinearRegression()
regrProstitution.fit(X, y)

print('Score: ' + str(regrProstitution.score(X, y)))
print('Slope: ' + str(regrProstitution.coef_[0]))

y_pred = regrProstitution.predict(X)
plt.scatter(X, y, color ='b')
plt.plot(X, y_pred, color ='k')
plt.title('Percent of Republican v Prostitution and Commercialized Vice Arrests in States')
plt.xlabel('Percent Republicans')
plt.ylabel('Percent Arrests')
  
plt.show()
Score: 0.004371009088417255
Slope: [-8.64995383e-05]

This R squared score (0.004) is again very low so we cannot reject the null hypothesis.

In [58]:
# prepares the x and y varibales for the linear regression model
X = np.array(df['Percent Republican']).reshape(-1, 1)
y = np.array(df['Vagrancy']).reshape(-1, 1)

regrVagrancy = LinearRegression()
regrVagrancy.fit(X, y)

print('Score: ' + str(regrVagrancy.score(X, y)))
print('Slope: ' + str(regrVagrancy.coef_[0]))

y_pred = regrVagrancy.predict(X)
plt.scatter(X, y, color ='b')
plt.plot(X, y_pred, color ='k')
plt.title('Percent of Republican v Vagrancy Arrests in States')
plt.xlabel('Percent Republicans')
plt.ylabel('Percent Arrests')
  
plt.show()
Score: 0.01151665969470439
Slope: [0.0001456]

Again this R squared value (0.01) is too low to reject the null hypothesis. It appears that doing linear regression on the arrest rates is not giving us evidence to reject the null hypothesis.

Now lets see if we can predict the party lean of a state based of the percent arrests they have in all the categories we looked at previously. For this we will use a classification machine learning algorithm since the partly lean variable is categorical. There are many different algorithms we can use for this like Decision Trees, Random Forest, Naive Bayes, and k nearest neighbors. For this tutorial we will look at Decision Trees and Random Forest. To test our models we will look at the accuracy of the test data predictions. Typically we want at least 90% accuracte.

In [59]:
# Here we will split our data into training and testing data. We will use this for our Deision Tree and Random Froest.
# We will use a traditional 70% training and 30% testing split. We will feed the train_test_split function a dataframe 
# including all the percents of arrests for the predicor values. We will feed the party lean column as the value to predict
X_train, X_test, y_train, y_test = train_test_split(df[['Total', 'Violent', 'Property', 'Drug Abuse', 'Curfew and Loitering', 'Embezzlement', 'Weapons', 'Prostitution and Commercialized Vice', 'Vagrancy']], df['Party Lean'], shuffle=True, test_size=0.3, random_state=42)

A decision tree splits off into branches on nodes to then classify based on the determining varibles. The leaves that the path takes you too are the predicted variables. We will start looking at the decision tree.

In [60]:
# initialize the model with standard parameters
dt_model = DecisionTreeClassifier(criterion="entropy")
# train the model
dt_model.fit(X_train,y_train)

# Get accuracy of the training data
y_train_pred = dt_model.predict(X_train)
a_dt_train = accuracy_score(y_train, y_train_pred)

# Get accuracy of the test data
y_test_pred = dt_model.predict(X_test)
a_dt_test = accuracy_score(y_test, y_test_pred)

print("Training data accuracy is " +  repr(a_dt_train) + " and test data accuracy is " + repr(a_dt_test))
Training data accuracy is 1.0 and test data accuracy is 0.3125

The test data accuracy is only 31.25%. This is not good. We can not reject our null hypothesis from this. Here we also notice that our training data accuracy is 100%. This can sometimes be a sign of overfitting (meaning the algorithm is too well trained to the test data that it wont perform well on the test data). When it is possibly overfitted, you should play around with the split of train and test data and which variables you feed in.

Now lets look at Random Forest. Random Forest will make a bunch of different decision trees with different varibles combinations. As the model gets trained, it gives the trees that are more commonly accurate more weight in the decision of which category to predict.

In [61]:
# initializes the model with 50 trees
rf_model = RandomForestClassifier(n_estimators=10, oob_score=True, n_jobs=1, criterion="entropy")
rf_model.fit(X_train, y_train)

# Get accuracy of the training data
rf_train_pred = rf_model.predict(X_train);
a_rf_train = accuracy_score(y_train, rf_train_pred);

# Get accuracy of the test data
rf_test_pred = rf_model.predict(X_test)
a_rf_test = accuracy_score(y_test, rf_test_pred)

print("Training data accuracy is " +  repr(a_rf_train) + " and test data accuracy is " + repr(a_rf_test))
Training data accuracy is 1.0 and test data accuracy is 0.25

The test data accuracy is only 25% which is again not high enough to reject the null hypothesis.

Insights¶

While we did see some interesting patterns while exploring this data, we did not have enough evidence to reject the null hypothesis. So, we will fail to reject our null hypothesis of there being a relationship between party affiliation of a state and arrest percentages. This indicates that political affiliation alone does not influence arrest rates. This makes sense since there will be a lot of factors that influence arrest rates. These factors would possibly include political affiliation, but this would need to be studied further.

With the data we have there are some flaws. For one thing, our dataset size was limited by the 50 states and DC. Perhaps a future study could look at the county level. Another issue is we didnt control for a lot of other variabes that could cause higher crime rates. further studies on this should examine more data like victimization data (the NCVS is a good source for this), poverty rates, percent of people living in cities, specific policies, and political affiliation over the past. If we control for these variables, we may find a stronger relationship.

We also saw an intersting pattern particularly with percent republican and percent drug arrest rates. This would be something that would be interesting to further study.

For more information about different machine learning algorithms you can start by looking at this website: https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/

In [ ]:
 
In [ ]:
 
In [ ]: